Data-Rich Approaches to English Morphology: From corpora and experiments to theory and back
نویسندگان
چکیده
(in alphabetical order of author’s last name) Measuring the unobservable: quantifying paradigm gaps and morphological retreat Adam Albright, MIT Observing the non-occurrence of ‘expected’ forms is a well-known problem for language acquisition and corpus studies (the problem of negative evidence). In morphology, expected forms may be missing for a variety of reasons, including paradigms gaps, selectional restrictions, and decreasing productivity of an affix. For example, for many speakers of American English, the verb ‘stride’ has a paradigm gap in the past participle (‘He has *stridden/*stroden/*strode’). Synchronically, this fact is reflected in the low token frequency of (etymologically expected) ‘stridden’ relative to other forms (e.g., present ‘stride’, past ‘strode’); however, merely observing low token frequency does not prove that stridden is rarer than expected, let alone explain why it is rare. In this talk, I consider how diachronic data can help document the existence of paradigm gaps, and offer insight into how gaps and selectional restrictions arise. I show that we can observe at least two different sources of restrictions: (1) failure to generalize an affix beyond its original domain of application, and (2) ‘retreat’ of an affix to a smaller phonological domain. I argue that both patterns are predicted by a conservative inductive learning model, such as the Minimal Generalization Learner (Albright and Hayes 2003). The first goal of the talk is empirical: using data from the Corpus of Historical American English (COHA; Davies 2011) and the Google n-grams corpus (Michel et al. 2011), I show that it is possible to observe the gradual erosion of previously attested forms such as ‘stridden’. A basic obstacle to observing underattestation is that we must estimate the expected frequency of affixed forms. In English, the token frequency of a verb’s past participle is related to the verb’s lemma frequency, but it depends on many additional factors as well (argument structure, collocations, real world factors), some of which are difficult to estimate. However, these factors appear to be relatively stable across time, such that in general, the relative frequency of the participle of a given verb tends to remain fairly constant. Fig 1 shows this for the proportion of ‘sung’ (etc.) tokens relative to present/infinitive ‘sing’ in the American English Google n-grams corpus. Not all verbs show such constancy, however. In particular, many of the verbs in the ‘stride’ class show a gradual decrease in the ratio of participle forms over the past 100 years, shown in Fig. 2 (normalized for lemma frequency). A loglinear model confirms that participles of this class have been waning over time, affecting especially (but not only) low frequency verbs like ‘stride’ and ‘strive’.
منابع مشابه
تأثیر ساختواژهها در تجزیه وابستگی زبان فارسی
Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...
متن کاملSolving Data Sparsity by Morphology Injection in Factored SMT
SMT approaches face the problem of data sparsity while translating into a morphologically rich language. It is very unlikely for a parallel corpus to contain all morphological forms of words. We propose a solution to generate these unseen morphological forms and inject them into original training corpora. We observe that morphology injection improves the quality of translation in terms of both ...
متن کاملUzbek-English and Turkish-English Morpheme Alignment Corpora
Morphologically-rich languages pose problems for machine translation (MT) systems, including word-alignment errors, data sparsity and multiple affixes. Current alignment models at word-level do not distinguish words and morphemes, thus yielding low-quality alignment and subsequently affecting end translation quality. Models using morpheme-level alignment can reduce the vocabulary size of morpho...
متن کاملMorphology Generation for Statistical Machine Translation
When translating into morphologically rich languages, Statistical MT approaches face the problem of data sparsity. The severity of the sparseness problem will be high when the corpus size of morphologically richer language is less. Even though we can use factored models to correctly generate morphological forms of words, the problem of data sparseness limits their performance. In this paper, we...
متن کاملMorphology and the Hierarchical Lexicon
Approaches to morphology typically account for regular, completely productive aaxation, while ignoring sub-regular and semi-productive schemata. I present an alternative approach to derivational morphology which relates exceptions and sub-regularities to the productive rules, and also captures some linguistically relevant generalizations that cannot be expressed in other theories. The data come...
متن کاملA Conversation Analysis of Ellipsis and Substitution in Global Business English Textbooks
Despite the body of research on textbook evaluation from the discourse analysis perspective, cohesive devices have rarely been analyzed in English for Specific Purposes (ESP) textbooks. The acquisition and use of cohesive devices is inherent to naturalistic communication, including business interactions. Hence, L2 learners of business English should be exposed to these devices through cohesion-...
متن کامل